Toponym Disambiguation in Natural Language Processing

نویسنده

  • Davide Buscaldi
چکیده

In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such information is contained as place names, or toponyms. The objective of this Ph.D. thesis is to study toponym ambiguity and the effects of its resolution in applications such as Geographical Information Retrieval (GIR), Question Answering (QA) and Web Retrieval. In GIR toponym ambiguity represents an important issue, due to the fact that queries are geographically constrained. There has been a struggle to find specific geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. The work presented in this thesis starts with an introduction to the applications in which Toponym Disambiguation (TD) may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used are place names repositories; these resources are the equivalent to language dictionaries, which provide the different meanings of a given word. It will be shown that the choice of a particular toponym repository should be done as a result of an analysis of the task that it is going to be carried out or depending on the specific kind of application that it is going to be developed. The choice of a proper Toponym Disambiguation method is also key. In this work two methods, a knowledge-based method and a map-based method, were developed and compared over the same test set. A case study of the application of TD methods to a corpus of Italian news is presented. The effects of the choice of a particular toponym resource and method in GIR have been studied, showing that TD may result useful if query length is short and a detailed resource is used. It has been found that the level of errors in disambiguation is not relevant, even in the case the errors represent 60% of the total number of toponyms in the collection, if the resource used has a little coverage and detail. Ranking methods that sort the results on the basis of geographical criteria were observed to be more sensitive to the use of TD or not, especially in the case of a detailed resource. It has been observed that the disambiguation of toponyms does not represent an issue in the case of Question Answering, because errors in TD are usually less important than other kind of errors in QA. In GIR, the geographical constraints contained in most queries are area constraints, such that the information need of users can be resumed to the form “X in P”, where P is a place name, and X represents the thematic part of the query. A frequent issue is when place name cannot be found in any resource because it is a fuzzy region or a vernacular name. In order to overcome this issue, Geooreka!, a prototype search engine with a map-based interface, was developed. A preliminary testing of this system is presented in this work. The work carried out on this search engine showed that Toponym Disambiguation can be particularly useful on web documents, especially for applications like Geooreka! that need to estimate the occurrence probabilities for places.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toponym Extraction and Disambiguation Enhancement using Loops of Feedback

Toponym extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation. First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation...

متن کامل

Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction

Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction an...

متن کامل

Toponym Disambiguation by Arborescent Relationships

Problem statement: The way of referring to a place in the geographical space can be formal, based on the spatial coordinates, or informal, which we use in natural language by using toponyms (place names). A toponym can represent several geographical places. This ambiguity made problematic its conversion towards a unique formal representation. Toponym disambiguation in text is the task of assign...

متن کامل

Resolving fine granularity toponyms: Evaluation of a disambiguation approach

Landscape descriptions in natural language, for instance from historic corpora, are a complementary source to empirical ethnographic work, for example to research exploring variation in the use of basic levels or basic terms within landscapes across localities (c.f. Mark and Turk 2003, Burenhult and Levinson 2008, Turk et al. 2011), on the condition that such descriptions can be linked to space...

متن کامل

Disambiguating Geographic Names in a Historical Digital Library

Geographic interfaces provide natural, scalable visualizations for many digital library collections, but the wide range of data in digital libraries presents some particular problems for identifying and disambiguating place names. We describe the toponym-disambiguation system in the Perseus digital library and evaluate its performance. Name categorization varies significantly among different ty...

متن کامل

A Hybrid Approach for Robust Multilingual Toponym Extraction and Disambiguation

Toponym extraction and disambiguation are key topics recently addressed by fields of Information Extraction and Geographical Information Retrieval. Toponym extraction and disambiguation are highly dependent processes. Not only toponym extraction effectiveness affects disambiguation, but also disambiguation results may help improving extraction accuracy. In this paper we propose a hybrid toponym...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009